[CASCL-1304] Snapshot node managers in ConfigMap during install by L3n41c · Pull Request #2945 · DataDog/datadog-operator

L3n41c · 2026-04-27T15:52:37Z

What does this PR do?

At the very end of kubectl datadog autoscaling cluster install, classify every node by its current management method (Fargate / Karpenter / EKS managed node group / EC2 ASG / standalone / unknown) and persist the snapshot as YAML in a dd-cluster-info ConfigMap in the Karpenter namespace. Also detects whether a legacy cluster-autoscaler Deployment is running (and where) and records it as a sibling field in the same ConfigMap.

The classification follows the decision tree documented in the Confluence page EKS Node Management Identification Guide — Signals, Tags and Datadog Visibility.

ConfigMap shape:

apiVersion: v1
clusterName: ...
generatedAt: ...
nodeManagement:
  fargate:
    <fargate-profile>: [<node-name>...]
  karpenter:
    <nodepool>: [...]
  eksManagedNodeGroup:
    <nodegroup>: [...]
  asg:
    <asg-name>: [...]
  standalone:
    "": [...]
  unknown:
    "<providerID>": [...]
clusterAutoscaler:
  present: true
  namespace: kube-system
  name: cluster-autoscaler

Motivation

Once Karpenter is up, the user has to migrate workloads off the existing nodes onto the new Karpenter-managed ones. The right migration action depends on how each existing node is currently managed (scale the EKS managed node group to 0, delete the Fargate profile, stop the legacy cluster-autoscaler first, etc.). Today the install command finishes without leaving any record of the cluster's pre-Karpenter topology, forcing a follow-up migration tool (or the user) to re-discover it from scratch.

This PR persists the topology in a ConfigMap so the upcoming migration step has a ground truth to drive from. Tracked in CASCL-1304 under epic CASCL-920.

Additional Notes

Errors from the snapshot step are logged as warnings and do not fail the install — Karpenter is already up at that point and the snapshot is informational; failing the command would force a confusing rerun of the whole pipeline.
New IAM permission required on the operator user's identity: autoscaling:DescribeAutoScalingInstances (batched at the documented 50-ID per call limit).
New package cmd/kubectl-datadog/autoscaling/cluster/common/clusterinfo placed under common/ (not install/) so a future uninstall-side reader does not need to import the install package.
The clusterAutoscaler field is a sibling of nodeManagement so future cluster-level metadata can be added without breaking the schema (apiVersion: v1 is set explicitly to allow versioned evolution).
The migration step that consumes the ConfigMap is out of scope for this PR.

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

Agent: N/A (kubectl plugin only — no Agent/Cluster Agent code path is touched)
Cluster Agent: N/A

Describe your test plan

Unit tests under cmd/kubectl-datadog/autoscaling/cluster/common/clusterinfo/ cover the empty-cluster, all-six-buckets, ASG-batching (75 nodes → 2 batches of 50+25), cluster-autoscaler detection by name / app.kubernetes.io/name label / k8s-app label / container-image substring, scaled-to-zero CA, and Persist create/update round-trips. Run with go test ./cmd/kubectl-datadog/autoscaling/cluster/common/clusterinfo/....

Manual e2e on a real EKS cluster:

make kubectl-datadog
AWS_PROFILE=... AWS_REGION=eu-west-3 \
  kubectl datadog autoscaling cluster install --cluster-name <test-cluster>
kubectl get cm -n dd-karpenter dd-cluster-info -o yaml

Cross-check the snapshot against kubectl get nodes -o wide and aws eks list-nodegroups.

Checklist

PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
PR has a milestone or the qa/skip-qa label
All commits are signed (see: signing commits)

Once Karpenter is installed on an existing EKS cluster, the user must migrate workloads off the existing nodes. The right migration action depends on how each node is currently managed (Fargate / Karpenter / EKS managed node group / EC2 ASG / standalone) and whether a legacy cluster-autoscaler is still running. Capture that topology in a `dd-cluster-info` ConfigMap so the follow-up migration step has a ground truth to drive from. Errors are logged as a warning and do not fail `install`: the snapshot is informational, and Karpenter is already up by the time it runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Run `make update-golang` to align test/e2e/go.mod and go.sum with the indirect aws-sdk-go-v2 / smithy-go patch bumps pulled in by `go work sync`. Required by the GitLab `check-golang-version` job. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

L3n41c · 2026-04-27T16:03:20Z

Here is the output of the manual e2e test:

$ kubectl -n dd-karpenter get cm/dd-cluster-info -o yaml

kind: ConfigMap
apiVersion: v1
data:
  cluster-info: |
    apiVersion: v1
    clusterName: lenaic-karpenter-test
    generatedAt: 2026-04-27T14:08:30.1565563Z
    nodeManagement:
        asg:
            eksctl-lenaic-karpenter-test-nodegroup-ng-bar-NodeGroup-awxH2qriZxFV:
                - ip-10-11-233-63.eu-west-3.compute.internal
        eksManagedNodeGroup:
            ng:
                - ip-10-11-232-138.eu-west-3.compute.internal
                - ip-10-11-234-250.eu-west-3.compute.internal
            ng-foo:
                - ip-10-11-234-249.eu-west-3.compute.internal
        fargate:
            "":
                - fargate-ip-10-11-234-123.eu-west-3.compute.internal
                - fargate-ip-10-11-234-246.eu-west-3.compute.internal
        karpenter:
            dd-karpenter-3qdyk:
                - ip-10-11-234-209.eu-west-3.compute.internal
                - ip-10-11-234-23.eu-west-3.compute.internal
                - ip-10-11-234-253.eu-west-3.compute.internal
                - ip-10-11-234-37.eu-west-3.compute.internal
    clusterAutoscaler:
        present: false
metadata:
  creationTimestamp: "2026-04-27T14:08:30Z"
  labels:
    app.kubernetes.io/managed-by: kubectl-datadog
  name: dd-cluster-info
  namespace: dd-karpenter
  resourceVersion: "1321418"
  uid: 18e5aec6-2a90-4c5d-96b4-57b2100521a9

The new clusterinfo package imports `github.com/aws/aws-sdk-go-v2/service/autoscaling` for DescribeAutoScalingInstances batching, so the third-party license manifest needs the matching entry. Patch matches the diff produced by the CI's `make verify-licenses` step verbatim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

L3n41c · 2026-04-27T16:23:24Z

@codex review

datadog-prod-us1-3 · 2026-04-27T16:23:51Z

🎯 Code Coverage (details)
• Patch Coverage: 82.01%
• Overall Coverage: 41.24% (+0.21%)

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: ec85702 | Docs | Datadog PR Page | Give us feedback!}

codecov-commenter · 2026-04-27T16:24:16Z

Codecov Report

❌ Patch coverage is 82.55034% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.12%. Comparing base (d5f00bf) to head (ec85702).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
...autoscaling/cluster/common/clusterinfo/classify.go	88.98%	7 Missing and 6 partials ⚠️
...ctl-datadog/autoscaling/cluster/install/install.go	0.00%	10 Missing ⚠️
.../autoscaling/cluster/common/clusterinfo/persist.go	90.00%	1 Missing and 1 partial ⚠️
...adog/autoscaling/cluster/common/clients/clients.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2945      +/-   ##
==========================================
+ Coverage   40.91%   41.12%   +0.21%     
==========================================
  Files         324      326       +2     
  Lines       28743    28894     +151     
==========================================
+ Hits        11760    11883     +123     
- Misses      16129    16150      +21     
- Partials      854      861       +7

Flag	Coverage Δ
unittests	`41.12% <82.55%> (+0.21%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...adog/autoscaling/cluster/common/clients/clients.go	`12.71% <0.00%> (-0.11%)`	⬇️
.../autoscaling/cluster/common/clusterinfo/persist.go	`90.00% <90.00%> (ø)`
...ctl-datadog/autoscaling/cluster/install/install.go	`15.31% <0.00%> (-0.46%)`	⬇️
...autoscaling/cluster/common/clusterinfo/classify.go	`88.98% <88.98%> (ø)`

... and 1 file with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d5f00bf...ec85702. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6a5a25b67c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T16:28:26Z

 		return err
 	}

+	if err = recordClusterInfo(ctx, cli, clusterName, karpenterNamespace); err != nil {


Snapshot before creating installer-managed nodes

In the default --install-mode=fargate path, installHelmChart has already waited for the Karpenter controller before this call, and the Fargate stack selects the Karpenter namespace, so EKS can already have created the new dd-karpenter-<cluster> Fargate node/profile by the time the snapshot is taken. That makes the ConfigMap include installer-created Karpenter infrastructure as if it were pre-existing topology, and a migration reader using this snapshot to retire legacy Fargate profiles could target the profile that keeps Karpenter running; take the snapshot before installing/waiting for Karpenter or filter the installer-created profile.

Useful? React with 👍 / 👎.

Good catch — fixed in 234ffa6.

Classify now runs before createCloudFormationStacks, so the Fargate profile and the controller node it eventually backs are not yet observable from the cluster. Persist stays at the end, where the Karpenter namespace exists. Both calls remain best-effort (warnings on failure).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6a5a25b67c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T17:04:54Z

+		},
+	}
+
+	return commonk8s.CreateOrUpdate(ctx, cli, cm)


Preserve the first cluster-info snapshot

When install is run again after a successful Fargate install, this CreateOrUpdate overwrites the existing dd-cluster-info ConfigMap. Since the Karpenter namespace/Fargate profile and controller node already exist before a rerun starts, the original pre-install topology can be replaced with installer-created nodes, so the follow-up migration may treat the dd-karpenter-* profile as legacy infrastructure. If the ConfigMap already exists, keep the original snapshot or filter installer-owned entries instead of updating it.

Useful? React with 👍 / 👎.

Contesting this one. The "preserve the first snapshot" approach has the same fundamental problem in the more common scenario where the user re-runs install after a partial failure (IAM hiccup, kubeconfig issue, transient AWS error, …): the "first" snapshot would freeze whatever state existed during that partially-successful first run, which may itself include installer-managed nodes — and the user would then be stuck with a wrong snapshot they cannot refresh.

Both alternatives — snapshot-before-install (#234ffa64, since reverted) and preserve-first-snapshot — buy correctness on a perfectly clean first run at the cost of locking the snapshot in cases where iteration is needed. We are accepting the late-snapshot trade-off knowingly: the migration consumer will learn to skip Karpenter-managed nodes in the dd-karpenter-* Fargate profile and the karpenter bucket. Tracking that filter in the migration tool is a one-liner and survives reruns.

Not actioning.

clamoriniere · 2026-04-27T18:33:24Z

+// ClusterAutoscaler captures whether a legacy cluster-autoscaler Deployment
+// is running and where, so the migration can warn the user to stop it before
+// scaling EKS managed node groups (per the Karpenter migration guide).
+type ClusterAutoscaler struct {


nits: maybe the version too.

Done in ec85702. The version comes from the image tag of the matching container (the source of truth), with app.kubernetes.io/version on the Deployment / pod template as a fallback for digest-only image references. The field is omitted from the YAML when neither signal is available.

clamoriniere · 2026-04-27T18:34:12Z

+	ClusterName       string                              `yaml:"clusterName"`
+	GeneratedAt       time.Time                           `yaml:"generatedAt"`
+	NodeManagement    map[NodeManager]map[string][]string `yaml:"nodeManagement"`
+	ClusterAutoscaler ClusterAutoscaler                   `yaml:"clusterAutoscaler"`


should we also collect if karpenter or karpenter-auto-mode was already present?

We already collect if some nodes are managed by Karpenter.
So, we may miss the fact that Karpenter is up and running if it hasn’t had to create any node yet.
May I suggest to add the presence of a detected Karpenter in a followup PR because I put the logic to detect “foreign” Karpenter (not installed by kubectl datadog) in another dedicated PR: #2949?

Per review feedback: the migration consumer needs to know which CA version is running so it can branch on known-bad versions or surface deprecation guidance. Capture the version, preferring the image tag of the matching container (the source of truth) and falling back to the `app.kubernetes.io/version` label set by most Helm charts. The field is omitted from the YAML when neither signal is available (e.g. an image referenced by digest only and no version label). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

L3n41c · 2026-04-28T08:33:34Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ec857027f1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

L3n41c · 2026-04-28T13:39:38Z

/merge

gh-worker-devflow-routing-ef8351 · 2026-04-28T13:39:42Z

View all feedbacks in Devflow UI.

2026-04-28 13:39:41 UTC ℹ️ Start processing command /merge

2026-04-28 13:39:46 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in main is approximately 2h (p90).

2026-04-28 15:40:26 UTC ❌ MergeQueue: The build pipeline has timeout

The merge request has been interrupted because the build 0 took longer than expected. The current limit for the base branch 'main' is 120 minutes.

L3n41c added enhancement New feature or request team/container-platform team/container-autoscaling labels Apr 27, 2026

L3n41c added this to the v1.27.0 milestone Apr 27, 2026

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

L3n41c force-pushed the lenaic/CASCL-1304-list-node-manager branch from 234ffa6 to 6a5a25b Compare April 27, 2026 16:55

L3n41c marked this pull request as ready for review April 27, 2026 16:58

L3n41c requested a review from a team April 27, 2026 16:58

L3n41c requested a review from a team as a code owner April 27, 2026 16:58

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

clamoriniere reviewed Apr 27, 2026

View reviewed changes

khewonc approved these changes Apr 27, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread cmd/kubectl-datadog/autoscaling/cluster/install/install.go

gh-worker-dd-devflow-36fce6 Bot added mergequeue-status: queued mergequeue-status: in_progress mergequeue-status: rejected and removed mergequeue-status: queued mergequeue-status: in_progress labels Apr 28, 2026

gh-worker-dd-devflow-36fce6 Bot added mergequeue-status: queued mergequeue-status: in_progress mergequeue-status: error and removed mergequeue-status: rejected mergequeue-status: queued mergequeue-status: in_progress mergequeue-status: error labels Apr 28, 2026

L3n41c merged commit 654839d into main Apr 28, 2026
109 checks passed

L3n41c deleted the lenaic/CASCL-1304-list-node-manager branch April 28, 2026 19:32

gh-worker-dd-devflow-36fce6 Bot added mergequeue-status: done and removed mergequeue-status: in_progress labels Apr 28, 2026

Conversation

L3n41c commented Apr 27, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Additional Notes

Minimum Agent Versions

Describe your test plan

Checklist

Uh oh!

L3n41c commented Apr 27, 2026

Uh oh!

L3n41c commented Apr 27, 2026

Uh oh!

datadog-prod-us1-3 Bot commented Apr 27, 2026 • edited by datadog-official Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

L3n41c Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

L3n41c Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

clamoriniere Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

L3n41c Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

clamoriniere Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

L3n41c Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

L3n41c commented Apr 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

L3n41c commented Apr 28, 2026

Uh oh!

gh-worker-devflow-routing-ef8351 Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

L3n41c commented Apr 27, 2026 •

edited by atlassian Bot

Loading

datadog-prod-us1-3 Bot commented Apr 27, 2026 •

edited by datadog-official Bot

Loading

codecov-commenter commented Apr 27, 2026 •

edited

Loading

L3n41c Apr 28, 2026 •

edited

Loading

gh-worker-devflow-routing-ef8351 Bot commented Apr 28, 2026 •

edited

Loading